Reproducibility in RNA-Seq Analysis with R

Advanced R-course 2025

Dr. Debasish Mukherjee, Dr. Ulrike Goebel, Dr. Ali Abdallah

Bioinformatics Core Facility CECAD

2025-11-21

What Is Reproducibility?

Reproducibility means that someone else (or future you) can run your code and obtain the same results, given the same input data.

Goal:
Make your analysis transparent, traceable, and re-runnable.

Why Reproducibility Matters in RNA-Seq

  • Ensures scientific integrity and verifiability
  • Facilitates collaboration across teams and institutions
  • Simplifies debugging and pipeline maintenance
  • Supports long-term sustainability of RNA-Seq workflows
  • Complies with data sharing and publication standards

Key Components



Component Tool Purpose
Directory structure Organized folders Clarity & modularity
Project management .Rproj Stable working directory
Environment control renv Package reproducibility
Version control Git + GitHub Collaboration & history
Containerization Docker Full environment capture

1. Maintaining Directory Structure (RNA-Seq Projects)

Recommended RNA-Seq Project Layout:

RNA-Seq_ProjectName/
├── data/
│ ├── raw_data/                 # FASTQ or BAM files (read-only)
│ ├── reference_data/           # Reference genome, GTF, annotations
│ ├── meta_data/                # Sample information & others (CSV, TSV)
│ └── processed_data/           # Derived data
│   ├── trimmed_data/           # Adapter-trimmed FASTQ files
│   ├── alignments_data/        # Alignment outputs (BAM/SAM)
│   └── counts_data/            # Gene/transcript counts

├── results/
│ ├── qc/                       # FastQC, MultiQC reports
│ ├── differential_expression/  # DESeq2, edgeR, limma results
│ ├── functional_profiling/     # GO, KEGG enrichment
│ └── final_figures/            # Publication-ready plots

├── reports/                    # Quarto or RMarkdown reports
├── scripts/                    # R or shell scripts
├── R/                          # Custom R functions
├── logs/                       # Pipeline logs
└── README.md

Tip

  • Keep data/raw_data/ read-only.
  • Separate code, data, and results.
  • Use here::here() or fs::path() for reproducible paths.
  • Document folder purpose in README.md.

2. Creating an R Project

  • Click on File → New Project
  • Select New Directory
  • Select “New Project”
  • Optionally
    • check “Create a git repository”
    • check “Use renv with this project”
  • Click Create Project

Tip

  • Creates .Rproj file as your root
  • Manages paths automatically
  • Integrates with Git, renv, and Quarto

3. Managing Dependencies with renv

Your analysis depends on the exact versions of packages used.
renv captures and restores them easily.


install.packages("renv")
renv::init()        # Initialize project environment
renv::snapshot()    # Save package versions to renv.lock
renv::restore()     # Recreate exact environment elsewhere


Tip

  • Ensures collaborators use identical packages
  • Avoids “it worked on my machine” problems
  • Integrates cleanly with Git

4. Version Control with Git & GitHub

Why Git?

  • Track every change in scripts and notebooks
  • Roll back easily to any commit
  • Collaborate efficiently with others
  • Integrates directly with RStudio

Setting Up Git in RStudio

  1. Enable Git:
    Tools → Global Options → Git → Enable Git
  2. Initialize Git in your project:
    r usethis::use_git()
  3. Use RStudio’s Git pane to:
    Stage changes → Commit with a message → Push to GitHub

GitHub Authentication (Required Step)

Using Personal Access Token (PAT)

In GitHub:

Settings → Developer Settings → Personal Access Tokens → Tokens (Classic)

Generate a new token with scopes:

repo, user, workflow

Copy the token (keep it private!)

In R:

usethis::create_github_token()
gitcreds::gitcreds_set()
# Paste your PAT when prompted

5. Connecting RStudio to GitHub

Once authenticated:

usethis::use_github()

This will:

  • Create a GitHub repository
  • Link your local repo (set remote origin)
  • Push all commits
  • Add a project badge to your README

RNA-Seq Reproducibility Workflow

graph 
    A[Start New RNA-Seq Project] --> B[Set Up Directory Structure]
    B --> C[Create R Project ]
    C --> D[renv for Managing Dependency]
    D --> E[Set Up Git Version Control]
    E --> F[Connect to GitHub Repository]
    F --> G[Develop RNA-Seq Analysis Scripts]
    G --> H[Commit & Push Changes Regularly]